Exploratory Analysis of Spatial Data: Spatial Autocorrelation

In this notebook we introduce methods of exploratory spatial data analysis that are intended to complement geovizualization through formal univariate and multivariate statistical tests for spatial clustering.

Imports



In [ ]:

    
import sys
import os
sys.path.append(os.path.abspath('..'))
from pysal.explore import esda



In [ ]:

    
import pandas as pd
import geopandas as gpd
import pysal.lib as lps
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Our data set comes from the Berlin airbnb scrape taken in April 2018. This dataframe was constructed as part of the GeoPython 2018 workshop by Levi Wolf and Serge Rey. As part of the workshop a geopandas data frame was constructed with one of the columns reporting the median listing price of units in each neighborhood in Berlin:



In [ ]:

    
df = gpd.read_file('data/neighborhoods.shp')
# was created in previous notebook with df.to_file('data/neighborhoods.shp')



In [ ]:

    
df.head()

We have an nan to first deal with:



In [ ]:

    
pd.isnull(df['median_pri']).sum()



In [ ]:

    
df = df
df['median_pri'].fillna((df['median_pri'].mean()), inplace=True)



In [ ]:

    
df.plot(column='median_pri')



In [ ]:

    
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
df.plot(column='median_pri', scheme='Quantiles', k=5, cmap='GnBu', legend=True, ax=ax)
#ax.set_xlim(150000, 160000)
#ax.set_ylim(208000, 215000)

Spatial Autocorrelation

Visual inspection of the map pattern for the prices allows us to search for spatial structure. If the spatial distribution of the prices was random, then we should not see any clustering of similar values on the map. However, our visual system is drawn to the darker clusters in the south west as well as the center, and a concentration of the lighter hues (lower prices) in the north central and south east.

Our brains are very powerful pattern recognition machines. However, sometimes they can be too powerful and lead us to detect false positives, or patterns where there are no statistical patterns. This is a particular concern when dealing with visualization of irregular polygons of differning sizes and shapes.

The concept of spatial autocorrelation relates to the combination of two types of similarity: spatial similarity and attribute similarity. Although there are many different measures of spatial autocorrelation, they all combine these two types of simmilarity into a summary measure.

Let's use PySAL to generate these two types of similarity measures.

Spatial Similarity

We have already encountered spatial weights in a previous notebook. In spatial autocorrelation analysis, the spatial weights are used to formalize the notion of spatial similarity. As we have seen there are many ways to define spatial weights, here we will use queen contiguity:



In [ ]:

    
wq =  lps.weights.Queen.from_dataframe(df)
wq.transform = 'r'

Attribute Similarity

So the spatial weight between neighborhoods $i$ and $j$ indicates if the two are neighbors (i.e., geographically similar). What we also need is a measure of attribute similarity to pair up with this concept of spatial similarity. The spatial lag is a derived variable that accomplishes this for us. For neighborhood $i$ the spatial lag is defined as: $$ylag_i = \sum_j w_{i,j} y_j$$



In [ ]:

    
y = df['median_pri']
ylag = lps.weights.lag_spatial(wq, y)



In [ ]:

    
ylag



In [ ]:

    
from pysal.viz import mapclassify as mc
ylagq5 = mc.Quantiles(ylag, k=5)



In [ ]:

    
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, \
        k=5, cmap='GnBu', linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.title("Spatial Lag Median Price (Quintiles)")

plt.show()

The quintile map for the spatial lag tends to enhance the impression of value similarity in space. It is, in effect, a local smoother.



In [ ]:

    
df['lag_median_pri'] = ylag
f,ax = plt.subplots(1,2,figsize=(2.16*4,4))
df.plot(column='median_pri', ax=ax[0], edgecolor='k',
        scheme="quantiles",  k=5, cmap='GnBu')
ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[0].set_title("Price")
df.plot(column='lag_median_pri', ax=ax[1], edgecolor='k',
        scheme='quantiles', cmap='GnBu', k=5)
ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[1].set_title("Spatial Lag Price")
ax[0].axis('off')
ax[1].axis('off')
plt.show()

However, we still have the challenge of visually associating the value of the prices in a neighborhod with the value of the spatial lag of values for the focal unit. The latter is a weighted average of homicide rates in the focal county's neighborhood.

To complement the geovisualization of these associations we can turn to formal statistical measures of spatial autocorrelation.

Global Spatial Autocorrelation

We begin with a simple case where the variable under consideration is binary. This is useful to unpack the logic of spatial autocorrelation tests. So even though our attribute is a continuously valued one, we will convert it to a binary case to illustrate the key concepts:

Binary Case



In [ ]:

    
y.median()



In [ ]:

    
yb = y > y.median()
sum(yb)

We have 68 neighborhoods with list prices above the median and 70 below the median (recall the issue with ties).



In [ ]:

    
yb = y > y.median()
labels = ["0 Low", "1 High"]
yb = [labels[i] for i in 1*yb] 
df['yb'] = yb

The spatial distribution of the binary variable immediately raises questions about the juxtaposition of the "black" and "white" areas.



In [ ]:

    
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
df.plot(column='yb', cmap='binary', edgecolor='grey', legend=True, ax=ax)

Join counts

One way to formalize a test for spatial autocorrelation in a binary attribute is to consider the so-called joins. A join exists for each neighbor pair of observations, and the joins are reflected in our binary spatial weights object wq.

Each unit can take on one of two values "Black" or "White", and so for a given pair of neighboring locations there are three different types of joins that can arise:

Black Black (BB)
White White (WW)
Black White (or White Black) (BW)

Given that we have 68 Black polygons on our map, what is the number of Black Black (BB) joins we could expect if the process were such that the Black polygons were randomly assigned on the map? This is the logic of join count statistics.

We can use the esda package from PySAL to carry out join count analysis:



In [ ]:

    
from pysal.explore import esda 
yb = 1 * (y > y.median()) # convert back to binary
wq =  lps.weights.Queen.from_dataframe(df)
wq.transform = 'b'
np.random.seed(12345)
jc = esda.join_counts.Join_Counts(yb, wq)

The resulting object stores the observed counts for the different types of joins:



In [ ]:

    
jc.bb



In [ ]:

    
jc.ww



In [ ]:

    
jc.bw

Note that the three cases exhaust all possibilities:



In [ ]:

    
jc.bb + jc.ww + jc.bw

and



In [ ]:

    
wq.s0 / 2

which is the unique number of joins in the spatial weights object.

Our object tells us we have observed 121 BB joins:



In [ ]:

    
jc.bb

The critical question for us, is whether this is a departure from what we would expect if the process generating the spatial distribution of the Black polygons were a completely random one? To answer this, PySAL uses random spatial permutations of the observed attribute values to generate a realization under the null of complete spatial randomness (CSR). This is repeated a large number of times (999 default) to construct a reference distribution to evaluate the statistical significance of our observed counts.

The average number of BB joins from the synthetic realizations is:



In [ ]:

    
jc.mean_bb

which is less than our observed count. The question is whether our observed value is so different from the expectation that we would reject the null of CSR?



In [ ]:

    
import seaborn as sbn
sbn.kdeplot(jc.sim_bb, shade=True)
plt.vlines(jc.bb, 0, 0.075, color='r')
plt.vlines(jc.mean_bb, 0,0.075)
plt.xlabel('BB Counts')

The density portrays the distribution of the BB counts, with the black vertical line indicating the mean BB count from the synthetic realizations and the red line the observed BB count for our prices. Clearly our observed value is extremely high. A pseudo p-value summarizes this:



In [ ]:

    
jc.p_sim_bb

Since this is below conventional significance levels, we would reject the null of complete spatial randomness in favor of spatial autocorrelation in market prices.

Continuous Case

The join count analysis is based on a binary attribute, which can cover many interesting empirical applications where one is interested in presence and absence type phenomena. In our case, we artificially created the binary variable, and in the process we throw away a lot of information in our originally continuous attribute. Turning back to the original variable, we can explore other tests for spatial autocorrelation for the continuous case.

First, we transform our weights to be row-standardized, from the current binary state:



In [ ]:

    
wq.transform = 'r'



In [ ]:

    
y = df['median_pri']

Moran's I is a test for global autocorrelation for a continuous attribute:



In [ ]:

    
np.random.seed(12345)
mi = esda.moran.Moran(y, wq)
mi.I

Again, our value for the statistic needs to be interpreted against a reference distribution under the null of CSR. PySAL uses a similar approach as we saw in the join count analysis: random spatial permutations.



In [ ]:

    
import seaborn as sbn
sbn.kdeplot(mi.sim, shade=True)
plt.vlines(mi.I, 0, 1, color='r')
plt.vlines(mi.EI, 0,1)
plt.xlabel("Moran's I")

Here our observed value is again in the upper tail, although visually it does not look as extreme relative to the binary case. Yet, it is still statistically significant:



In [ ]:

    
mi.p_sim

Local Autocorrelation: Hot Spots, Cold Spots, and Spatial Outliers

In addition to the Global autocorrelation statistics, PySAL has many local autocorrelation statistics. Let's compute a local Moran statistic for the same d



In [ ]:

    
np.random.seed(12345)
from pysal.explore import esda



In [ ]:

    
wq.transform = 'r'
lag_price = lps.weights.lag_spatial(wq, df['median_pri'])



In [ ]:

    
price = df['median_pri']
b, a = np.polyfit(price, lag_price, 1)
f, ax = plt.subplots(1, figsize=(9, 9))

plt.plot(price, lag_price, '.', color='firebrick')

 # dashed vert at mean of the price
plt.vlines(price.mean(), lag_price.min(), lag_price.max(), linestyle='--')
 # dashed horizontal at mean of lagged price 
plt.hlines(lag_price.mean(), price.min(), price.max(), linestyle='--')

# red line of best fit using global I as slope
plt.plot(price, a + b*price, 'r')
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of Price')
plt.xlabel('Price')
plt.show()

Now, instead of a single $I$ statistic, we have an array of local $I_i$ statistics, stored in the .Is attribute, and p-values from the simulation are in p_sim.



In [ ]:

    
li = esda.moran.Moran_Local(y, wq)



In [ ]:

    
li.q

We can again test for local clustering using permutations, but here we use conditional random permutations (different distributions for each focal location)



In [ ]:

    
(li.p_sim < 0.05).sum()

We can distinguish the specific type of local spatial association reflected in the four quadrants of the Moran Scatterplot above:



In [ ]:

    
sig = li.p_sim < 0.05
hotspot = sig * li.q==1
coldspot = sig * li.q==3
doughnut = sig * li.q==2
diamond = sig * li.q==4



In [ ]:

    
spots = ['n.sig.', 'hot spot']
labels = [spots[i] for i in hotspot*1]



In [ ]:

    
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['red', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()



In [ ]:

    
spots = ['n.sig.', 'cold spot']
labels = [spots[i] for i in coldspot*1]



In [ ]:

    
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['blue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()



In [ ]:

    
spots = ['n.sig.', 'doughnut']
labels = [spots[i] for i in doughnut*1]



In [ ]:

    
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['lightblue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()



In [ ]:

    
spots = ['n.sig.', 'diamond']
labels = [spots[i] for i in diamond*1]



In [ ]:

    
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['pink', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()



In [ ]:

    
sig = 1 * (li.p_sim < 0.05)
hotspot = 1 * (sig * li.q==1)
coldspot = 3 * (sig * li.q==3)
doughnut = 2 * (sig * li.q==2)
diamond = 4 * (sig * li.q==4)
spots = hotspot + coldspot + doughnut + diamond
spots



In [ ]:

    
spot_labels = [ '0 ns', '1 hot spot', '2 doughnut', '3 cold spot', '4 diamond']
labels = [spot_labels[i] for i in spots]



In [ ]:

    
from matplotlib import colors
hmap = colors.ListedColormap([ 'lightgrey', 'red', 'lightblue', 'blue', 'pink'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

S-maup: Statistical test to measure the sensitivity to the modifiable areal unit problem

This is a nonparametric statistical test to measure the sensitivity of a spatially intensive variable to the effects of the Modifiable Areal Unit Problem (MAUP). The S-maup statistic can be used to find the maximum level of spatial aggregation (k) that preserves the distributional characteristics of the spatial variable before being aggregated.

H0: The variable y is not significantly affected by the MAUP.

H1: The variable y is significantly affected by the MAUP.

Citation: Duque JC, Laniado H, Polo A (2018) S-maup: Statistical test to measure the sensitivity to the modifiable areal unit problem. PLoS ONE 13(11): e0207377. https://doi.org/10.1371/journal.pone.0207377

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0207377

Let's find the maximum number of regions (k) in which we can aggregate (n) areas before having the negative consequences of the MAUP

1. load the map



In [ ]:

    
#import pysal.lib
#from pysal.explore.esda.moran import Moran
#from pysal.explore.esda.smaup import Smaup

2. select the variable to analyze



In [ ]:

    
#f = pysal.lib.io.open(pysal.lib.examples.get_path("stl_hom.txt"))
#y = np.array(f.by_col['HR8893'])

3. calculate the level of spatial autocorrelation



In [ ]:

    
#w = pysal.lib.io.open(pysal.lib.examples.get_path("stl.gal")).read()
#rho = Moran(y,  w).I

4. calculate the number of areas (n)



In [ ]:

    
#n = len(y)

5. let's begin with the minimum level of aggregation which is the number of areas minus one



In [ ]:

    
'''
k = len(y)-1
crit_val0_05 = np.inf
smaup = -np.inf
#the reduce k until H0 is rejected 
while smaup < crit_val0_05:
    s = Smaup(n,k,rho)
    smaup = s.smaup
    crit_val0_05 = s.critical_05
    k -= 1
'''

6. maximum number of regions (k) is...



In [ ]:

    
#print(k+1)